AITopics

Country:

Europe (1.00)
North America > United States > Virginia (0.46)

Genre: Research Report > Experimental Study (1.00)

Industry:

Information Technology (0.93)
Health & Medicine (0.67)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

Neural Information Processing SystemsFeb-10-2026, 12:58:57 GMT

d2a27e83d429f0dcae6b937cf440aeb1-Paper.pdf

proof step, prover, theorem, (16 more...)

Country:

North America > United States > North Carolina > Wake County > Morrisville (0.04)
North America > Canada (0.04)
Europe > Sweden > Stockholm > Stockholm (0.04)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Logic & Formal Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.69)

Neural Information Processing SystemsAug-16-2025, 14:36:59 GMT

d2a27e83d429f0dcae6b937cf440aeb1-Supplemental.pdf

expression, proof step, theorem, (16 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Neural Information Processing SystemsAug-16-2025, 14:36:51 GMT

Learning to Prove Theorems by Learning to Generate Theorems

We consider the task of automated theorem proving, a key AI task.

proof step, prover, theorem, (15 more...)

Country:

North America > United States > North Carolina > Wake County > Morrisville (0.04)
North America > Canada (0.04)
Europe > Sweden > Stockholm > Stockholm (0.04)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Logic & Formal Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Kripner, Matěj, Šustr, Michal, Straka, Milan

LeanTree: Accelerating White-Box Proof Search with Factorized States in Lean 4

arXiv.org Artificial IntelligenceJul-22-2025

Automated theorem proving (ATP) has been a classical problem in artificial intelligence since its inception, yet it remains challenging due to its vast state and action space. Large language models (LLMs) have recently emerged as a promising heuristic for ATP, but they lack correctness guarantees and thus require interaction with a proof verifier. Such interactions typically follow one of two approaches: black-box interaction, which does not utilize intermediate proof states, or white-box approaches, which allow for incremental proof construction and examination of intermediate states. While black-box approaches have directly benefited from recent LLM advances, white-box methods have comparatively lagged behind. In this paper, we address this gap by introducing LeanTree, which consists of (i) a tool built in the Lean 4 language that factorizes complex proof states into simpler, independent branches, and (ii) a dataset of these factorized intermediate states. Our white-box tooling offers several advantages over black-box approaches: it simplifies evaluation, reduces necessary context, generates richer training data, enables parallel search across multiple states, supports efficient reuse of states, and provides feedback in case of errors. Our preliminary results hint that white-box approaches outperform black-box alternatives in some settings.

large language model, logic & formal reasoning, machine learning, (17 more...)

2507.14722

Country: Europe (0.46)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Logic & Formal Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

arXiv.org Artificial IntelligenceJul-11-2025

Frontier LLMs Still Struggle with Simple Reasoning Tasks

Malek, Alan, Ge, Jiawei, Lazic, Nevena, Jin, Chi, György, András, Szepesvári, Csaba

While state-of-the-art large language models (LLMs) demonstrate advanced reasoning capabilities-achieving remarkable performance on challenging competitive math and coding benchmarks-they also frequently fail on tasks that are easy for humans. This work studies the performance of frontier LLMs on a broad set of such "easy" reasoning problems. By extending previous work in the literature, we create a suite of procedurally generated simple reasoning tasks, including counting, first-order logic, proof trees, and travel planning, with changeable parameters (such as document length. or the number of variables in a math problem) that can arbitrarily increase the amount of computation required to produce the answer while preserving the fundamental difficulty. While previous work showed that traditional, non-thinking models can be made to fail on such problems, we demonstrate that even state-of-the-art thinking models consistently fail on such problems and for similar reasons (e.g. statistical shortcuts, errors in intermediate steps, and difficulties in processing long contexts). To further understand the behavior of the models, we introduce the unpuzzles dataset, a different "easy" benchmark consisting of trivialized versions of well-known math and logic puzzles. Interestingly, while modern LLMs excel at solving the original puzzles, they tend to fail on the trivialized versions, exhibiting several systematic failure patterns related to memorizing the originals. We show that this happens even if the models are otherwise able to solve problems with different descriptions but requiring the same logic. Our results highlight that out-of-distribution generalization is still problematic for frontier language models and the new generation of thinking models, even for simple reasoning tasks, and making tasks easier does not necessarily imply improved performance.

large language model, machine learning, natural language, (20 more...)

2507.07313

Country: North America > United States > Oklahoma (0.14)

Genre: Research Report > New Finding (0.87)

Industry:

Media > Film (1.00)
Leisure & Entertainment (1.00)
Consumer Products & Services > Travel (0.66)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)

arXiv.org Artificial IntelligenceJun-9-2025

Constructive Symbolic Reinforcement Learning via Intuitionistic Logic and Goal-Chaining Inference

Patrascu, Andrei T.

We introduce a novel learning and planning framework that replaces traditional reward-based optimisation with constructive logical inference. In our model, actions, transitions, and goals are represented as logical propositions, and decision-making proceeds by building constructive proofs under intuitionistic logic. This method ensures that state transitions and policies are accepted only when supported by verifiable preconditions -- eschewing probabilistic trial-and-error in favour of guaranteed logical validity. We implement a symbolic agent operating in a structured gridworld, where reaching a goal requires satisfying a chain of intermediate subgoals (e.g., collecting keys to open doors), each governed by logical constraints. Unlike conventional reinforcement learning agents, which require extensive exploration and suffer from unsafe or invalid transitions, our constructive agent builds a provably correct plan through goal chaining, condition tracking, and knowledge accumulation. Empirical comparison with Q-learning demonstrates that our method achieves perfect safety, interpretable behaviour, and efficient convergence with no invalid actions, highlighting its potential for safe planning, symbolic cognition, and trustworthy AI. This work presents a new direction for reinforcement learning grounded not in numeric optimisation, but in constructive logic and proof theory.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

2506.05422

Genre: Research Report (0.82)

Industry: Education (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

arXiv.org Artificial IntelligenceJun-2-2025

ProofNet++: A Neuro-Symbolic System for Formal Proof Verification with Self-Correction

Ambati, Murari

Table I presents the quantitative evaluation of ProofNet++ across three distinct datasets. The FPSR (Final Proof Success Rate) metric shows that the system performs best on the mathlib-extract dataset with a 74.9% success rate, followed by miniF2F at 68.4%, and the HOL Light Testbed trailing at 63.5%. Similarly, the PPC (Proof Production Correctness) values align with this trend, indicating higher intermediate proof accuracy on mathlib-extract (88.0%) compared to the other datasets. The EDPT (Edit Distance to Proof Target) metric reveals that mathlib-extract proofs require fewer correction steps (2.4) than miniF2F (3.2) and HOL Light (4.0), suggesting that the system is more efficient in approximating correct proofs in that domain. Latency measurements reflect verifier runtime, with mathlib-extract exhibiting the fastest average verification time (176 ms), whereas HOL Light has the highest latency (214 ms). Lastly, the average proof length varies notably, with HOL Light proofs being the longest (14.3 steps), potentially contributing to its higher latency and lower success metrics. These results indicate that while ProofNet++ demonstrates strong performance on established libraries like mathlib-extract, there is room for improvement on datasets with more complex or longer proofs, such as HOL Light. Enhancements could focus on optimizing proof search strategies and reducing verifier latency, particularly for longer proofs, to improve overall efficiency and success rates. E. Benchmark Pipeline Overview Figure 1 illustrates the full evaluation pipeline used to benchmark ProofNet++, from the initial input prompt to the final corrected proof output.

large language model, logic & formal reasoning, machine learning, (19 more...)

2505.2423

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Logic & Formal Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)

Abdulla, Parosh Aziz, Atig, Mohamed Faouzi, Cailler, Julie, Liang, Chencheng, Rümmer, Philipp

Guiding Word Equation Solving using Graph Neural Networks (Extended Technical Report)

arXiv.org Artificial IntelligenceNov-19-2024

This paper proposes a Graph Neural Network-guided algorithm for solving word equations, based on the well-known Nielsen transformation for splitting equations. The algorithm iteratively rewrites the first terms of each side of an equation, giving rise to a tree-like search space. The choice of path at each split point of the tree significantly impacts solving time, motivating the use of Graph Neural Networks (GNNs) for efficient split decision-making. Split decisions are encoded as multi-classification tasks, and five graph representations of word equations are introduced to encode their structural information for GNNs. The algorithm is implemented as a solver named DragonLi. Experiments are conducted on artificial and real-world benchmarks. The algorithm performs particularly well on satisfiable problems. For single word \mbox{equations}, DragonLi can solve significantly more problems than well-established string solvers. For the conjunction of multiple word equations, DragonLi is competitive with state-of-the-art string solvers.

artificial intelligence, machine learning, word equation, (19 more...)

2411.15194

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.14)
North America > United States > New York > New York County > New York City (0.04)
Europe > Sweden > Uppsala County > Uppsala (0.04)
(9 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Dong, Kefan, Mahankali, Arvind, Ma, Tengyu

Formal Theorem Proving by Rewarding LLMs to Decompose Proofs Hierarchically

arXiv.org Artificial IntelligenceNov-4-2024

Mathematical theorem proving is an important testbed for large language models' deep and abstract reasoning capability. This paper focuses on improving LLMs' ability to write proofs in formal languages that permit automated proof verification/evaluation. Most previous results provide human-written lemmas to the theorem prover, which is an arguably oversimplified setting that does not sufficiently test the provers' planning and decomposition capabilities. Instead, we work in a more natural setup where the lemmas that are directly relevant to the theorem are not given to the theorem prover at test time. We design an RL-based training algorithm that encourages the model to decompose a theorem into lemmas, prove the lemmas, and then prove the theorem by using the lemmas. Our reward mechanism is inspired by how mathematicians train themselves: even if a theorem is too challenging to be proved by the current model, a positive reward is still given to the model for any correct and novel lemmas that are proposed and proved in this process. During training, our model proposes and proves lemmas that are not in the training dataset. In fact, these newly-proposed correct lemmas consist of 37.7% of the training replay buffer when we train on the dataset extracted from Archive of Formal Proofs (AFP). The model trained by our RL algorithm outperforms that trained by supervised finetuning, improving the pass rate from 40.8% to 45.5% on AFP test set, and from 36.5% to 39.5% on an out-of-distribution test set.

conditional proof, lemma, theorem, (14 more...)

2411.01829

Country:

South America > Brazil (0.04)
North America > United States > California > Santa Clara County > Palo Alto (0.04)
Europe > Germany > Berlin (0.04)
(2 more...)

Genre: Research Report (0.64)

Industry: Health & Medicine (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Logic & Formal Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)